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Model selection for Poisson processes 



Abstract: Our purpose in this paper is to apply the general methodology 
for model selection based on T-estimators developed in Birge [Ann. Inst. H. 
Poincare Probab. Statist. 42 (2006) 273-325] to the particular situation of 
the estimation of the unknown mean measure of a Poisson process. We in- 
troduce a Hellinger type distance between finite positive measures to serve as 
our loss function and we build suitable tests between balls (with respect to 
this distance) in the set of mean measures. As a consequence of the existence 
of such tests, given a suitable family of approximating models, we can build 
T-estimators for the mean measure based on this family of models and analyze 
their performances. We provide a number of applications to adaptive intensity 
estimation when the square root of the intensity belongs to various smoothness 
classes. We also give a method for aggregation of preliminary estimators. 



1. Introduction 

This paper deals with the estimation of the mean measure /j of a Poisson pro- 
cess X on X. More precisely, we develop a theoretical, but quite general method 
for estimating [i by model selection with applications to adaptive estimation and 
aggregation of preliminary estimators. The main advantage of the method is its gen- 
erality. We do not make any assumption on apart from the fact that it should be 
finite and we allow arbitrary countable families of models provided that each model 
be of finite metric dimension, i.e. is not too large in a suitable sense to be explained 
below. We do not know of any other estimation method allowing to deal with model 
selection in such a generality and with as few assumptions. The main drawback of 
the method is its theoretical nature, effective computation of the estimators being 
typically computationally too costly for permitting a practical implementation. In 
order to give a more precise idea of what this paper is about, we need to start by 
recalling a few well-known facts about Poisson processes that can, for instance, be 
found in Reiss [29j | . 

1.1. The basics of Poisson processes 

Let us denote by Q+(X) the cone of finite positive measures on the measurable 
space (X, £). Given an element /i e Q+(X), a Poisson process on X with mean 
measure fi is a point process X = {Xi, . . . ,Xn} on X such that N has a Pois- 
son distribution with parameter il(X) and, conditionally on TV, the Xi are i.i.d. 
with distribution fi± = (j,/fj,(X). Equivalently, the Poisson process can be viewed as 
a random measure Ax = Xa=i i $x denoting the Dirac measure concentrated 
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at the point x. Then, whatever the partition Ai, . . . , A n of X , the n random vari- 
ables Ax(Ai) are independent with Poisson distributions and respective parameters 
n(Ai) and this property characterizes a Poisson process. We shall denote by Q M the 
distribution of a Poisson process with mean measure p on X. We recall that, for 
any nonnegative measurable function tfi on (X,£), 



(1.1) 

and 

(1.2) 



i=i 



4>(x) dp(x) 
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If p,,p £ Q+(X) and /i <C v, then <C Q v and 



2V 



(1.3) 



{X u ...,X N ) = w[v(X) - fi(X)} H^iXA, 



dQ v 



dv 



with the convention that Y\^—x{d^/dv){Xi) — 1. 



1.2. Introducing our loss function 

From now on, we assume that we observe a Poisson process X on X with unknown 
mean measure fi € Q+(X) so that /i always denotes the parameter to be esti- 
mated. For this, we use estimators fj-(X) with values in Q+(X) and measure their 
performance via the loss function H q (fi(X) , u) for q > 1, where H is a suitable 
distance on Q + (X). To motivate its introduction, let us recall some known facts. 
The Hellinger distance h between two probabilities P and Q defined on the same 
space and their Hellinger affinity p are given respectively by 



(1.4) h 2 {P,Q) = ^J ' (ydP-y/dQf , p(P,Q) = J VdPdQ = l-h 2 (P,Q), 

where dP and dQ denote the densities of P and Q with respect to any dominating 
measure, the result being independent of the choice of such a measure. If X\, . . . , X n 
are i.i.d. with distribution P on X and Q is another distribution, it follows from an 
exponential inequality that, for all 



(1.5) 



^Tlog 



dQ 
dP 



{Xi) > 2x 



< exp [ralog (p(P, Q)) - x] 

< exp [nh 2 (P,Q) - x] , 



which provides an upper bound for the errors of likelihood ratio tests. In particular, 
if p and p' are two elements in Q + (X) dominated by some measure A, it follows 
from ()1.3p and ([1.2P that the Hellinger affinity piQ^^Q^) between p and p! is given 

by 




(1.6) p(Q m .Qm')= / \\ /n 7n d ® x = ex P [~H 2 {p^')\ , 
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where 



(1.7) 



H 



2 ( M , fx') = - [n(X) + (i'(Xj\ - / ^(dfi/d\)(dfi>/d\) 



(1.8) 




Comparing (|1.8p with (|1.4[) indicates that -ff is merely the generalization of the 
Hellinger distance h between probabilities to arbitrary finite positive measures and 
the introduction of H turns Q+{X) into a metric space. Moreover, we derive from 
(|1.5| with n = 1 that, when X is a Poisson process with mean measure fi on X, 



If /x(A") = fi'(X) = n, then H 2 {fi,fi') — nh 2 (fix , fi[) and (jl.9p becomes a perfect 
analogue of (11. 5|) . The fact that the errors of likelihood ratio tests between two 
probabilities are controlled by their Hellinger affinity justifies the introduction of 
the Hellinger distance as the natural loss function for density estimation, as shown 
by Le Cam j26|. It also motivates the choice of H q as a natural loss function for 
estimating the mean measure of a Poisson process. For simplicity, we shall first 
focus on the quadratic risk E[H 2 (fi(X), fx)}. 

1.3. Intensity estimation 

A case of particular interest occurs when we have at hand a reference positive 
measure A on X and we assume that fi -C A with dfi/dX — s, in which case 
s is called the intensity (with respect to A) of the process with mean measure fi. 
Denoting by L^~(A) the positive part of L^(A) for i = 1, 2, we observe that s £ L^A), 
Vie La PO and M e Q\ = {fit = t ■ X,t e L+(A)}. The one-to-one correspondence 
i I— > fi t between (A) and Q\ allows us to transfer the distance H to L^"(A) which 
gives, by (fT8)) , 



where || • || 2 stands for the norm in L2 (A) . When fi = fi s £ Q\ it is natural to estimate 
it by some element fi{X) = s(X) - A of Q\, in which case H(jl(X), ft) — H(s(X), s) 
and our problem can be viewed as a problem of intensity estimation: design an 
estimator s(X) G L^(A) for the unknown intensity s. From now on, given a Poisson 
process X with mean measure fi, we shall denote by E M and P M (or E s and P s when 
fi = fi s ) the expectations of functions of X and probabilities of events depending 
on X, respectively. 

1.4- Model based estimation and model selection 

It is common practice to try to estimate the intensity s on X by a piecewise constant 
function, i.e. a histogram estimator s(X) belonging to the set 



(1.9) 




(1.10) 
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of nonnegative piecewise constant functions with respect to the partition {Ii, . . . , 
Id} = fn of X with > for all j. More generally, given a finite family 

m = {<y5i, . . . , <Pd} of elements of L,2(A), we may consider the D-dimensional linear 
space S m generated by the ipj and try to estimate yfs by some element y/ s(X) G 
S m . This clearly leads to difficulties since S m is not a subset of L^~(A), but we 
shall nevertheless show that it is possible to design an estimator s m (X) with the 
property that 



(1.11) 



E s [H 2 (s m (X),s)] <C 



inf 

tes m 



\t 



where \m\ = D stands for the cardinality of m and C is a universal constant. In this 
approach, S m should be viewed as a model for y/s, which means an approximating 
set since we never assume that yfs G S m and the risk bound (jl.lip has (up to the 
constant C) the classical structure of the sum of an approximation term inf tg g \\t— 
V^lll and an estimation term \m\ corresponding to the number of parameters to be 
estimated. 

If we introduce a countable (here countable always means finite or countable) 
family of models {S m ,m G Ai} of the previous form, we would like to know to 
what extent it is possible to build a new estimator s(X) such that 

(1.12) E s \H 2 (s(X),s)} < C inf (inf \\t - ^s\\l + \m\\ , 

meM [tes m J 

for some other constant C", i.e. to know whether one can design an estimator which 
realizes, up to some constant, the best compromise between the two components 
of the risk bound (ll.ll| . The problem of understanding to what extent (11.12|) does 
hold has been treated in many papers using various methods, mostly based on the 
minimization of some penalized criterion. A special construction based on testing 
has been introduced in Birge and then applied to different stochastic frameworks. 
We shall show here that this construction also applies to Poisson processes and then 
derive the numerous consequences of this property. We shall, in particular, be able 
to prove the following result in Section [3.4. II below. 

Theorem 1. Let A be some positive measure on X and \\ ■ \\2 denote the norm in 
L,2(A). Let {S m } m< =M be a finite or countable family of linear subspaces o/L2(A) 
with respective finite dimensions D m and let {A m } me _A4 be a family of nonnegative 
weights satisfying 



(1.13) 



E 



exp[— A m ] < S < +oo. 



Let X be a Poisson process on X with unknown mean measure /i = fi s + /U where 
s G L^(A) and /i 1 - is orthogonal to A. One can build an estimator fi = fi(X) = 
s(X) ■ A G Q\ satisfying, for all fi G Q+{X) and q> 1, 



(1.14) 



<C(q) [1 + E] 



inf 



inf 

tes m 



\Vs-t\ 



D m V A, 



with a constant C{q) depending on q only. 
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When (i = fi s 6 Q\, (|1.14[) becomes 

(1.15) E s [H 9 (s,s)l <C(g)[l + S] inf j inf \\y/s -t\\ + J~D m V A 

meM [teSm 

Typical examples for X and A are [0, l] fc with the Lebesgue measure or {1; . . . ; n} 
with the counting measure. In this last case, the n random variables Ax{{i}) = 
Ni are independent Poisson variables with respective parameters Si — s(i) and 
observing X is equivalent to observing a set of n independent Poisson variables 
with varying parameters, a framework which is usually studied under the name of 
Poisson regression. 



1.5. Model selection for Poisson processes, a brief review 

Although there have been numerous papers devoted to estimation of the mean 
measure of a Poisson process, only a few, recently, considered the problem of model 
selection, the key reference being Reynaud-Bouret (30| with extensions to more 
general processes in Reynaud-Bouret [31|. A major difference with our approach 
is her use of the L 2 (A)-loss, instead of the Hellinger type loss that we introduce 
here. It first requires that the unknown mean measure fi be dominated by A with 
intensity s and that s e L2(A). Moreover, as we shall show in Section [2731 the use 
of the L2-I0SS typically requires that s e Loo(A). This results in rather complicated 
assumptions but the advantage of this approach is that it is based on penalized 
projection estimators which can be computed practically while the construction of 
our estimators is too computationally intensive to be implemented on a computer, 
as we shall explain below. The same conclusions essentially apply to all other pa- 



pers dealing with the subject. The approach of Gregoire and Nembe 21], which 
extends previous results of Barron and Cover § about density estimation to that 
of intensities, has some similarities with ours. The paper by Kolaczyk and Nowak 
[25| based on penalized maximum likelihood focuses on Poisson regression. Meth- 
ods which can also be viewed as cases of model selection are those based on the 
thresholding of the empirical coefficients with respect to some orthonormal basis. It 
is known that such a procedure is akin to model selection with models spanned by 
finite subsets of a basis. They have been considered in Kolaczyk [2J|, Antoniadis, 
Besbeas and Sapatinas [l[, Antoniadis and Sapatinas Q and Patil and Wood [jjj ]. 



1.6. An overview of the paper 

We already justified the introduction of our Hellinger type loss-functions by the 
properties of likelihood ratio tests and we shall explain, in the next section, why 
the more popular L2-risk is not suitable for our purposes, at least if we want to 
deal with possibly unbounded intensities. To show this, we shall design a general 
tool for getting lower bounds for intensity estimation, which is merely a version of 
Assouad's Lemma Q for Poisson processes. We shall also show that recent results 
by Rigollet and Tsybakov [33] on aggregation of estimators for density estimation 
extend straightforwardly to the Poisson case. In Section [3l we briefly recall the 
general construction of T-estimators introduced in Birge [9| and apply it to the 
specific case of Poisson processes. We also provide an illustration based on non- 
linear approximating models. Section [4] is devoted to various applications of our 
method based on families of linear models. This section essentially relies on results 



Model selection for Poisson processes 



37 



from approximation theory about the approximation of different classes of functions 
(typically smoothness classes) by finite dimensional linear spaces in L2 . We also in- 
dicate how to mix different families of models and introduce an asymptotic point of 
view which allows to consider convergence rates and to make a parallel with density 
estimation. In Section [5J we deal with aggregation of estimators with some appli- 
cations to partition selection for histograms. The final Section [5] is devoted to the 
proof of the most important technical result in this paper, namely the existence and 
properties of tests between balls of mean measures. This is the key argument which 
is required to apply the construction of T-estimators to the problem of estimating 
the mean measure of a Poisson process. It also has other applications, in particular 
to the study of Bayesian procedures as done, for instance, in Ghosal, Ghosh and 
van der Vaart (20j and subsequent work of van der Vaart and coauthors. 



2. Estimation with L2-I0SS 

2.1. From density to intensity estimation 

A classical approach to density estimation is based on L2-I0SS. We assume that the 
observations X\, . . . ,X n have a density s% with respect to some dominating mea- 
sure A and that si belongs to the Hilbert space IL.2(A) with scalar product (•, •) and 
norm || • || 2 . Given an estimator s(Xi, . . . ,X n ) we define its risk by E[||s — Si|||]. 
In this theory, a central role is played by projection estimators as defined by Cen- 
cov [lj]. Model selection based on projection estimators has been considere d by 
Birge and Massart [HI]. A more modern treatment can be found in Massart |27j . 
Thresholding estimators based on wavelet expansions as described in Cohen, De- 
Vore, Kerkyacharian and Picard [l5| (see also the many further references therein) 
can also be viewed as special cases of those. Recently Rigollct and Tsybakov [32| 
introduced an aggregation method based on projection estimators. Projection esti- 
mators have the advantage of simplicity and the drawback or requiring somewhat 
restrictive assumptions on the density si to be estimated, not only that it belongs 
to L2 but most of the time to L^. As shown in Birge [IOJ , Section 5.4.1, the fact 
that Si belongs to is essentially a necessary condition to have a control on the 
L2-risk of estimators of s\. 

As indicated in Baraud and Birge [1] Section 4.2, there is a parallel between 
the estimation of a density s\ from n i.i.d. observations and the estimation of 
the intensity s = ns\ from a Poisson process. This suggests to adapt the known 
results from density estimation to intensity estimation for Poisson processes. We 
shall briefly explain how it works, when the Poisson process X has an intensity 
s € L oc (A) with Loo-norm ||s||oo- 

The starting point is to observe that, given an element ip € L 2 (A), a natural 
estimator of (<p, s) is fp(X) = J tpdkx = vi-^i)- ^ follows from (jl.ip that 

(2.1) E s [lp(X)} = (<p,s) and Var s (<p(X)) = J ip 2 s dX - (tp, s) 2 < ||a|UM|i 

Given a Z?-dimensional linear subspace S' of L 2 (A) with an orthonormal basis 
tpi, . . . , ipu, we can estimate s by the projection estimator with respect to S": 



' N 



38 



L. Birge 



It follows from (|2.ip that its risk is bounded by 

(2.2) E s [\\s(X) - s\\l] < mf \\t-a\\l+ Ws^D. 

Note that s(X) is not necessarily an intensity since it may take negative values. 
This can be fixed: replacing s(X) by its positive part can only reduce the risk since 
s is nonnegative. 



2. 2. Aggregation of preliminary estimators 



The purpose of this section is to extend some recent results for aggregation of 
density estimators due to Rigollet and Tsybakov 32] to intensity estimation. The 
basic tool for aggregation in the context of Poisson processes is the procedure of 
"thinning" which is the equivalent of sample splitting for i.i.d. observations, see for 
instance Reiss ;29], page 68. Assume that we have at our disposal a Poisson process 
with mean measure /i: Ax = X)i=i £x< an d an independent sequence (li)i>i of 
i.i.d. Bernoulli variables with parameter p€ (0, 1). Then the two random measures 

Axi = Si=i^i^x« an( l = — Yi)$Xi are two independent Poisson 

processes with respective mean measures pfj, and (1 — p)fx. 

Now assume that X is a Poisson process with intensity s with respect to A, 
that Xi and X% have been derived from X by thinning and that we have at our 
disposal a finite family {s m (Xi),m e A1} of estimators of ps based on the first 
process and belonging to L2(A). They may be projection estimators or others. These 
estimators span a /^-dimensional linear subspace of La(A) with an orthonormal 
basis (fx, ... , ipo, D < \M\. Working conditionally with respect to Xi, we use X2 
to build a projection estimator s^) of (1 — p)s belonging to the linear span of the 
estimators s m (Xi). This is exactly the method used by Rigollet and Tsybakov [32| 
for density estimation and the proof of their Theorem 2.1 extends straightforwardly 
to Poisson processes to give 

Theorem 2. The aggregated estimator s based on the processes Xi and X2 by 

thinning of X satisfies 

(2.3) 



E s [\\~s(X) - (1 - p)s\\l] < 



inf 



ps 



E « 

mEM 



,{Xi 



+ (l-p)||a|| 00 | J M|. 



Setting s{X) = s(X)/(l - p) leads to 



E s [\\s(X) - S |||] < inf E. hps - SmiXjWl 

(1 — p) L 



,\M\ 



1-p 



If we start with a finite family {S m , m € Al} of finite-dimensional linear subspaces 
of L2(A) with respective dimensions D m , we may choose for s m (X±) the projection 
estimator based on S m with risk bounded by (|2.2j) 



\s m {Xi) -ps\\ 



< inf ||<-ps||5+p||s|[ 

t£S m 



p 2 inf 

tes m 



i+p||s||oo£>r, 



Choosing p = 1/2, we conclude that 



E, 



\s(X) 



< inf 



inf || t 

tes m 



s\\l 



Dr. 



+ 211 



>\M\. 
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2.3. Lower bounds for intensity estimation 

It is rather inconvenient to get risk bounds involving the unknown and possibly 
very large Loo-norm of s and this problem becomes even more serious if s does 
not belong to L QO (A). It is, unfortunately, impossible to avoid this problem when 
dealing with the L2-I0SS. To show this, let us start with a version of Assouad's 
Lemma Q for Poisson processes. 

Lemma 1. LetSri — {ss,5 £ 2?} C (A) be a family of intensities indexed byT> = 
{0; 1} D and A be the Hamming distance on T> given by A(<5, 5') = X^=i \$j — <^l- 
Let C be the subset ofDxV defined by 

C = {(6, 6') I 3k, 1 < k < D with 6 k = 0, S' k = 1 and 5 3 = 5] for j ^ k}. 
Then for any estimator 5(X) with values in T>, 



(2.4) su P E Sa A [S(X),S 



> 



D 



\C\ 



J2 cxp[-2H 2 ( Ss , Ss ,)] 



(6,6') ec 



If, moreover, So C L C ILj~(A) and L is endowed with a metric d satisfying 
d 2 (ss,s$i) > 9A(5,5') for all 5,5' 6 V and some 6 > 0, then for any estimator 
s(X) with values in L, 



(2.5) 



sup . 

s£So 



[d 2 (s(X),s)]>^^ J2 exp[-2H 2 ( S „ S „)]j 



Proof. To get (|2.4I) it suffices to find a lower bound for 



Rb = 2-° E s 



A 6,5 



= 2 



6ev 



6eV k=i 



dQ s 



since the left-hand side of (|2.4[) is at least as large as the average risk Rb- It follows 
from the proof of Lemma 2 in Birge [Xoj ] with n — 1 that 



Rb > 2" 



£ i-^i-p 2 {Q S5 ,Qs s ,) >2- d -' Y p 2 (Q^Qs s >)- 

(6,6')ec ^ (6,6')ec 

Then (f2~4")> follows from (fl~fj|) since |C| = Z^- 0-1 . Let now s(X) be an estimator 
with values in L and set 5(X) € T> to satisfy ^(SjS^) = inf ci(s, s^) so that, 
whatever <5 6 T>, d(sz, ss) < 2<i(s, ss). It then follows from our assumptions that 



supE SJ [d 2 (s, s s )] > -supE Si5 [d 2 (s~ s ,ss)] > -supE SiS 
6ev " 4 6ev 4 <5ei5 

and (|2~5)) follows from (23). 



A 



□ 



The simplest application of this lemma corresponds to the case D = 1 which, in 
its simplest form, dates back to Le Cam [26}. We consider only two intensities so 
and si so that 9 — d 2 (so, si) and (12. 5|) gives, whatever the estimator s(X), 



(2.6) 



max E s< [d 2 (s(X),s,)] > 



d 2 (s ,s 1 ) 
16 



exp [-2 J ff 2 (s ,si) 



Another typical application of the previous lemma to intensities on [0, 1] uses the 
following construction of a suitable set Sr>- 
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Lemma 2. Let D be a positive integer and g be a function on K with support on 
[0, D^ 1 ) satisfying 

< g(x) < 1 for all x and / g 2 ( x ) dx = a > 0. 

Jo 

Set, for 1 < j < D and < x < I, gj(x) — g(x — D~ 1 (j — 1)) and, for 5 G 
V, s$(x) — a _1 [l + Y^j=i(&3 ~ l/tygj(%)]- Then \\s$ — ss>\\ 2 = a~ 1 A(5, 5') and 
H 2 {s s ,s s >) > A(S, S')/8 for all 6, S' G V. Moreover, 

(2.7) \C\- 1 J2 exp[-2H 2 (s s ,s s ,)] > exp[-2/7]. 

(5,<5')ee 

Proof. The first equality is clear. Let us then observe that our assumptions on g 
imply that 1 — g 2 (x)/7 < y/1 — g 2 (x)/4 < 1 — g 2 (x)/8, hence, since the functions 
gj have disjoint supports and are translates of g, 



D „D-* _ _ 2 



H 2 (s s ,s s ,) = (2a)- 1 Y,\^-Sj\ I VI + - a/1 - s(a:)/2 



f/.I' 



= a- 1 J2\6 J ~6>\ [ \l-^l-g 2 {x)/A dx = cA(8,d>), 

with 1/8 < c < 1/7. The conclusions follow. □ 

Corollary 1. For eac/i positive integer D and L > 3D/2, one can /ind a finite set 
Sd of intensities with the following properties: 

(i) it is a subset of some D- dimensional affine subspace of 1^,2 ([0,1], dx); 

(a) su Pse5D ||s||oo < L; 

(Hi) for any estimator s(X) with values in L2QO, 1], dx) based on a Poisson 
process X with intensity s, 

(2.8) sup E s [p - s||l] > (DL/24) exp[-2/7]. 

sS5r> 

Proof. Let us set 9 = 2L/3 > D and apply the construction of Lemma [2] with 
g(x) = \jDjQ 1[o,i/d), hence a = Q~ x . This results in the set Sd with ||sa||oo < 

9 1+ (l/2)-y/D/6 < 36/2 = L for all 5 € V as required. Moreover \\s s - s s ,\\ 2 = 

dA(5,S'). Then we use Lemma Q] with d being the distance corresponding to the 
norm in L 2 ([0, 1], dx) and (pO)]) together with (|2~7| result in (|2^)) . □ 

This result implies that, if we want to use the squared IL,2-norm as a loss function, 
whatever the choice of our estimator there is no hope to find risk bounds that 
are independent of the Loo-norm of the underlying intensity, even if this intensity 
belongs to a finite-dimensional affine space. This provides an additional motivation 
for the introduction of loss functions based on the distance H . 



3. T-estimators for Poisson processes 
3.1. Some notations 

Throughout this paper, we observe a Poisson process X on X with unknown mean 
measure /i belonging to the metric space (Q+(X),H) and have at hand some ref- 
erence measure A on A" so that fx = fx s + fx ± with /j, s G Q\, s G L^(A) and n ± 



Model selection for Poisson processes 



11 



orthogonal to A. We denote by || • ||i the norm in Lj(A) for 1 < i < oo and by c?2 
the distance corresponding to the norm || • H2. We always denote by s the intensity 
of the part of p which is dominated by A and set si = s/ p s {X). We also systemat- 
ically identify Q\ with L^~(A) via the mapping t 1— > /i t , writing t as a shorthand for 
fJ-t G Qx- We write H(s. S') for inf tg s< H(s,t) 7 aV6 and aAb for the maximum and 
the minimum respectively of a and b, \A\ for the cardinality of a finite set A and 
N* = N \ {0} for the set of positive integers. In the sequel C (or C", C\, . . .) denote 
constants that may vary from line to line, the form C(a, b) meaning that C is not 
a universal constant but depends on some parameters a and b. 

3.2. Definition and properties of T-estimators 

In order to explain our method of estimation and model selection, we need to recall 
some general results from Birge [9] about T-estimators that we shall specialize to 
the specific framework of this paper. Let (M, d) be some metric space and B(t, r) 
denote the open ball of center t and radius r in M. 

Definition 1. A subset S' of the metric space (M,d) is called a D-model with 
parameters n, D and B 1 (n,B',D > 0) if 

(3.1) \S'D B(t,xj])\ < B'exp [Dx 2 ] for all x > 2 and t € M. 
Note that this implies that S' is at most countable. 

To estimate the unknown mean measure fi of the Poisson process X, we introduce 
a finite or countable family {S m ,m 6 Ai} of D-models in (Q\,H) with respective 
parameters rj m , D m and B' and assume that 

(3.2) for all meM, D m > 1/2 and rf n > (84D m )/5, 
and 

(3.3) J2 CX P H™/ 84 ] = S < +co. 

Then we set S = UmeX ^ m anc ^' ^ or eacn t € S, 

(3.4) f?(i) = inf{?7 m \m€ M and 5* m 9 t}. 

Remark. Note that if we choose for {S m ,m e Ai} a family of D-models in 
(Q+(X),H), S is countable and therefore dominated by some measure A that we 
can always take as our reference measure. This gives an a posteriori justification 
for the choice of a family of models S m C Q\. 

Given two distinct points t,u € Q\ we define a test function ip{X) between t 
and u as a measurable function from X to {t, u}, tp{X) = t meaning deciding t and 
ip{X) = u meaning deciding u. In order to define a T-estimator, we need a family of 
test functions ipt,u(X) between distinct points t,u € S with some special properties. 
The following proposition, to be proved in Section [5] warrants their existence. 

Proposition 1. Given two distinct points t, u € S there exists a test ip t .u between 
t and u which satisfies 

sup P M hMX) = u] 

{/xeQ+W I H(p^ t )<H(tM)/A} 

< exp [- (H 2 (t, u) - n 2 (t) + r/ 2 (u)) /4] , 
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sup P^l/jt.uiX) = t] 

{neQ+(X) | H( l u,,u u )<H(t,u)/4} 

< exp [- (H 2 (t, u) - n 2 (u) + rf{t)) /4] , 

and /or /z G Q+(< ; f), 

(3.5) P M [^,« W = «] < exp [(l6iTV Mt) + r? 2 (i) - r) 2 (u)) /4] . 

To build a T-estimator, we proceed as follows. We consider a family of tests ipt,u 
indexed by the two-points subsets {t, u} of 5* with t ^ u that satisfy the conclusions 
of Proposition [1] and we set lZ t — {u G S, u ^ t \ t/; ttU (X) = u} for each t S S. Then 
we define the random function T>x on S by 

sup {H(t,u)} if K t ^ 0; 

uen t 

if ft* = 0. 

We call T-estimator derived from S and the family of tests ipt jU (X) any measurable 
minimizcr of the function t h-> T>x{t) from 5 to [0, +oo] so that Px(s(X)) = 
inf te s T>x {t). Such a minimizer need not exist in general but it actually exists 
under our assumptions. 

Theorem 3. Let S = [J rneM S m C Q\ be a finite or countable family of D-models 
in (Q\,H) with respective parameters rj m ,D m and B' satisfying h3.2}) and i3.3\) . 
Let {"0*, u} be a family of tests indexed by the two-points subsets {t, it} of S with 
t =/= u and satisfying the conclusions of Proposition^ Whatever fi S Q+{X), P M - 
a.s. there exists at least one T-estimator s = s(X) G S derived fom this family of 
tests and any of them satisfies, for all s' G S , 

(3.6) ¥^[H(s',s) >y] < (£'£/7) exp [-y 2 /6] for y > 4[H(n, /v) V r)(s% 

Setting p.(X) — s(X) • A and pk = fi s + H with /j, s G Q\ and fj, orthogonal to X, 
we also get 

(3.7) E M [ZT«(p, £(*■))] <C(q)[l + B'E] inf { H(s, S m ) + r, m + J ^(X) 

m£M { v 

and, for intensity estimation when ji = ji s , 

(3.8) E s [H* (s, s{X)) ] < C(q)[l + S'E] inf {H(s, S m ) + Vm } q . 

Proof. It follows from Theorem 5 in Birge with a = 1/4, B = 1, k = 4 and 
k' = 16 that T-estimators do exist, satisfy (|3.6p and have a risk which is bounded, 
for q > 1, by 

(3.9) E p [H"(jJt,fi(X))] < C{q)[l+B'E] inf ( f inf HQ*, Mt ) ) V ry„T • 

In Birge , the proof of the existence of T-estimators when M is infinite was given 
only for the case that the tests tp t ,u(X) have a special form, namely ip t .u(X) = u 
when 7 (it, X) < j(t, X) and ip t<u (X) = i when 7(u, X) > 7(2, X) for some suitable 
function 7. A minor modification of the proof extends the result to the general 
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situation based on the assumption that (13. 5| holds. It is indeed enough to use (|3.5p 
to modify the proof of (7.18) of Birge [9( in order to get instead 

P„[3teS with tps'.t(X) = 1 and rj(t) > y] — ► 0. 

y — >+oo 

The existence of s{X) then follows straightforwardly. Since H 2 (/i,fj, t ) — H 2 (s,t) + 
H X (X)/2, d3Z7D follows from (JSU). □ 

It follows from (|3.7p that the problem of estimating n with T-estimators always 
reduces to intensity estimation once a reference measure A has been chosen. A 
comparison of the risk bounds ()3.7j) and (|3.8|) shows that the performance of the 
estimator s(X) is connected to the choice of the models in Lj*~(A), the component 
a^~(X) of the risk depending only on A. We might as well assume that /j, (X) is 
known since this would not change anything concerning the performance of the 
T-estimators for a given A. This is why we shall essentially focus, in the sequel, on 
intensity estimation. 



3.3. An application to multivariate intensities 

Let us first illustrate Theorem[3]by an application to the estimation of the unknown 
intensity s (with respect to the Lebesgue measure A) of a Poisson process on X = 
[-1,1]*. For this, we introduce a family of non-linear models related to neural nets 
which were popularized in the 90's by Barron [B|, [|| and other authors in view of 
their nice approximation properties with respect to functions of several variables. 
These models have already been studied in detail in Sections 3.2.2 and 4.2.2 of 
Barron, Birge and Massart [7j and we shall therefore refer to this paper for their 
properties. We start with a family of functions <p w (x) £ Loo([— 1, l] fc ) indexed by a 
parameter w belonging to R fc and satisfying 

(3.10) \(j) w (x)- (j) w ,(x)\ < \w-w'\i for all x £ [—1, l] fc , 

where | • |i denotes the ^-norm on R fe . Various examples of such families are given 
in Barron, Birge and Massart Q and one can, for instance, set 4> w {x) — i/j(a'x — b) 
with ip a univariate Lipschitz function, a £ R fc , b £M. and w = (a, b) £ M. k+1 . 

We set M = (N \ {0, l}) 3 and for m = (J, R,B) £ M we consider the subset of 
^ ([-1,1]*) defined by 



J =1 



J 

E 



\/3j\ < R and |i0j-|i < B for 1 < j < J 



As shown in Lemma 5 of Barron, Birge and Massart Q, such a model can be 
approximated by a finite subset T m . More precisely, one can find a subset T m of S' m 
with cardinality bounded by [2e(2RB + l)] J ( k +1 ) and such that if u £ S' m , there 
exists some t £ T m such that ||t — it ||oo < 1. Defining S m as {t 2 ,t £ T m }, we get 
the following property: 

Lemma 3. For rn = {J, R, B) £ (N \ {0, l}) 3 , we set r}^ = 42J(fc' + 1) \og(RB). 
Then S m is a D-model with parameters i] m ,D m = [J{k' + l)/4] log[2e(2i?_B + 1)] 
and 1 in the metric space (L^ (A), if) and and \3. 3\) are satisfied. Moreover, 

for any s £ L^(A), 

(3.11) V2H(s,S m ) < w£ \\^-t\\ 2 + 2 k/2 . 



li 
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Proof. Since |S m | < \T m \, to show that S m is a D- model with the given parameters 
it is enough to prove, in view of (|3.ip . that \T m \ < exp[4D TO ], which is clear. That 
7?^/84 > D m /5 follows from log[2e(2i?S+l)j < 41og(i?B) since RB > 4. Moreover, 
since k' + l> 2, rj^ > 8AJlog(RB), hence 



n 

,/>2 \n>2 




so that ()3.3p holds. Let now u G S' m . There exists t S T m such that \\t — w||oo < 1, 
hence || y/s - t\\ 2 < \\t/s - u\\ 2 + 2 fe / 2 . Then t 2 G S m and since H^/i - V¥\\ 2 < 
\\^/s-t\\ 2 , (j3TT|) follows. □ 

Let now s(X) be a T-estimator derived from the family of D-models {S m , m G 
A4}. By Theorem [3] and Lemma |3l it satisfies 

■)fe i „2 



E s [H 2 (s,s(X))]<C inf <^ inf II y/i - t\\ + 2 k + r), 

(3.12) <C{k,k') inf j inf \\y/s - t\\l + Jlog(RB) 

The approximation properties of the models S' m with respect to different classes 
of functions have been described in Barron, Birge and Massart 0- They allow to 
bound inftgg/ \\\fs — t\\ 2 when -^/s belongs to such classes so that corresponding 
risk bounds can be derived from (|3.12|) . 



3-4- Model selection based on linear models 

3.4-1- Deriving D-models from linear spaces 

In order to apply Theorem [3] we need to introduce suitable families of D-models S m 
in (Q\, H) with good approximation properties with respect to the unknown s. More 
precisely, it follows from (|3.7[) and (|1.10[) that they should provide approximations 
of y/s in L^"(A). Good approximating sets for elements of L^(A) are provided by 
approximation theory and some recipes to derive D-models from such sets have been 
given in Section 6 of Birge [9]. Most results about approximation of functions in 
L2 (A) deal with finite dimensional linear spaces or unions of such spaces and their 
approximation properties with respect to different classes (typically smoothness 
classes) of functions. We therefore focus here on such linear subspaces of L2(A). 
To translate their properties in terms of D-models, we shall invoke the following 
proposition. 

Proposition 2. Let S be a k-dimensional linear subspace o/L2(A) and 8 > 0. One 
can find a subset S' of Q\ which is a D-model in the metric space (Q\, H) with 
parameters S 7 9k and 1 and such that, for any intensity s G L^(A), 

H(s,S') < 2.2 

Proof. Let us denote by Br and B 2 the open balls in the metric spaces (L^~ (A) , H) 
and (L 2 (A), d 2 ) respectively. It follows from Proposition 8 of Birge 0] that one can 



inf k/s — t\\ + S 

tes 
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find a subset T of S which is a D-model of (L 2 (A), d 2 ) with parameters 6, k/2 and 
1 and such that, whatever u G L 2 (A), d 2 (w, T) < <i 2 (tt, S) + 5. It follows that 

(3.13) T n B 2 (t, 3r'V2j < exp [9k(r'/S) 2 ] for r' > 26 and t G L 2 (A). 

Moreover, if t G T, 7r(t) = max{£, 0} belongs to L^(A) and satisfies <i 2 (u, ?[(£)) < 
d,2(u,t) for any u G L^(A). We may therefore apply Proposition 12 of Birge [9( with 
(M',d) = (L 2 (A),d 2 ), M = 14(A), A = 1, s = 1/10, r\ = 4\/25 and r = r'%/2 to 
get a subset S 1 of n (T) C L^" (A) such that 

(3.14) \SnB 2 (t,r'V2) | < \TnB 2 (t,3r'y/2) \ V 1 for all t e L 2 (A) and r' > 26 

and d 2 (u, 5 1 ) < 3.1d 2 (u, T) for all u S (A). Setting S" = {i 2 • A, i G S)} C Qa and 
using (fTTTD]) , we deduce from (|3TT5|) and (|3"Tn)l that 

|S"nS ff (/i t ,r')| <exp[9fc(r7,5) 2 ] for r' > 25 and m G Qa, 

hence S" is a D-model in (Qa, H) with parameters <5, 9fc and 1, and 

H(s, S r ) < (3.1/ d 2 (V5, T) < 2.2 [d 2 5) + S] . D 

We are now in a position to prove Theorem [1] For each m, let us fix rj", = 
84[A m V (9Z3 m /5)] and use Proposition [5] to derive from S m a D-model S m with 
parameters rj m ,D m = 9D m and 1 which also satisfies 



H(s,S m ) < 2.2 



inf 

tes m 



1 + 7?T) 



It follows from the definition of r\ m that p.2p and p.3p are satisfied so that Theo- 
rem [3] applies. The conclusion immediately follows from (|3.7[) . 



3-4-2. About the computation of T- estimators 



We already mentioned that the relevance of T-estimators is mainly of a theoretical 
nature because of the difficulty of their implementation. Let us give here a simple 
illustrative example based on a single linear approximating space S for y/s, of dimen- 
sion k. To try to get a practical implementation, we shall use a simple discretization 
strategy. The first step is to replace S, that we identify to R fc via the choice of a 
basis, by 8Z k . This provides an 77-net for M fc with respect to the the Euclidean 
distance, with rj 2 = k(8/2) 2 . Let us concentrate here on the case of a large value of 
r 2 = f s dX in order to have a large number of observations since N has a Poisson 
distribution with parameter T 2 . In particular, we shall asume that T 2 (which plays 
the role of the number of observations as we shall see in Section [05]) is much larger 
than k. It is useless, in such a case, to use the whole of #Z fc to approximate ^fs since 
the closest point to y/s belongs to B(0,T + rj). Of course, T is unknown, but when 
it is large it can be safely estimated by >/~N in view of the concentration properties 
of Poisson variables. Let us therefore assume that N > F 2 /2 > 2k. A reasonable 
approximating set for y/s is therefore T — B(0, y/2N + 77) n 0Z k and since our final 
model S should be a subset of Ljj" (A), we can take S — {t V 0, t G T} so that 
d 2 {\/s, S) < d 2 (y/s, T) < d 2 (y/s, S + rf). It follows from Lemma 5 of Birge [1] that 
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with c = y/ire/2 ~ 2.07. This implies that S is a D-model with parameters 
r], (logif)/4 and 1. In order that (|3.2p be satisfied, we need that ?? 2 > 4.2 log if. If 
we choose rj 2 = 4.2k \og(c(y^ N / k + 1)), this inequality holds since r\ > 2^/k, hence 
K < [c(y/ N/k + l)] fe . The number of tests required for building the T-estimator is 
|-5|(|iS'| — 1) < K 2 . For N of the order of 100 and k as small as 5, K 2 is of the order 
of 10 10 . This toy example illustrates the difficulty of implementing the algorithm. 
More realistic ones would be much worse. 



4. Applications with linear models 

We now assume that \x — \x s — s ■ X and focus on the estimation of the inten- 
sity s by model selection, starting with linear models in L2(A) that possess good 
approximating properties with respect to y/s. 



4-1. Adaptation in Besov spaces 

It is now well-known that wavelet bases are very good tools for representing smooth 
functions in L2([0, l] l ,dx). In particular, given a suitable wavelet basis {<fij,k,j > 
-1, k S AO')} with |A(-1)| < T and 2-? 7 < |A(j)| < T2^ for all j > any function 
/ G h 2 {[0,l] l 7 dx) can be written as / = J2 keA(j) 0j,k<Pj,k- Moreover / 

belongs to the Besov space -Bp jOO ([0, 1] 1 ) if and only if 



(4.1) 



sup2^-£)( Y \P 3 M P ) = |/| B « i00 < +oc, 



v fceA(i) 

and it belongs to Bp q ([Q, l} 1 ) with q < +oo if 









i 


E 




} [ e p 


= 1/11- <+™ 

P, 1 


j>0 




\keA(j) j 





Many properties of those function spaces are to be found in DeVore and Lorentz 
pj| . DeVore and Hardle, Kerkyacharian, Picard and Tsybakov [^l among other 
references. 

As a consequence of Theorem [TJ we can derive an adaptation result for the 
estimation of the intensity of a Poisson process when it belongs to some Besov 
space on [0, 1]'. 

Theorem 4. Let X be a Poisson process with unknown intensity s with respect 
to Lebesgue measure on [0, 1]'. Let us assume that y/s belongs to some Besov space 
B^ ^[p, 1]') for some unknown values of p > 0, a > l(l/p— 1/2)+ and Iv^l-B"^ 
given by [4. One can build a T-estimator s(X) such that 

2l/(2a+l) 



(4.2) 



[H 2 (s,s)] <C(a,p,l) |7i| fl?00 Vl 



Proof. We just use Proposition 13 of Birge Q which provides suitable families 
M.j{2 1 ) of linear approximation spaces for functions in Bp* ^,([0, and use the 
family of linear spaces {S m } m£ M with M = Ui>i Uj>o -^4^(2') provided by this 
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proposition. Then, for m € A4j(2 i ), D m < ci(2 l ) + ci{2 % )2^ 1 and we choose A m = 
C3(2 I )2 j/ which implies that (|1.13[) holds with E < 1. Applying Proposition 13 
of Birge [9( with t = i/s, r = 2 % > a > 2 l ~ 1 and q — 2, we derive from Theorem [T] 
that, if R = \^/s\ B?00 V 1, 



Choosing for j the smallest integer such that 2^ l+2a ^ > R 2 leads to the result. □ 
4-2. Anisotropic Holder spaces 

Let us recall that a function / defined on [0, 1) belongs to the Holder class H(a, R) 
with a = (3 + p, p E N, 0</3< 1 and R > if / has a derivative of order p 
satisfying \f( p >(x) — f^ p \y)\ < R\x — y\@ for all x, y € [0, 1). Given two multi-indices 
a = . . . , afc) and il = {R\, . . . , i?^) in (0, +oo)' c , we define the anisotropic 
Holder class 7i(a, il) as the set of functions / on [0, l) fe such that, for each j 
and each set of k — 1 coordinates x\, . . . , Xj-\, Xj+i , . . . , Xk the univariate function 
y !-* /(au, ■ ■ ■ ,Xj-i,y,Xj+i, ...,x k ) belongs to TL{aj,R 3 ). 

Let now a multi-integer AT = {N\, . . . , TVfe) S (N*) fe be given. To it corresponds 
the hyperrectangle n*=i Pi -^j" 1 ) an d the partition 2jv of [0, l) fc into Jlj-i ^j' trans- 
lates of this hyperrectangle. Given an integer r € N and m = (TV, r) we can define 
the linear space S m of piecewise polynomials on the partition Xjv with degree at 
most r with respect to each variable. Its dimension is D m — (r + l) k Ylj—i Nj. 
Setting M = (N*) fc x N and A m = D m , we get (|1.13p with E depending only on 

as shown in the proof of Proposition 5, page 346 of Barron, Birge and Massart 
0. The same proof also implies (see (4.25), page 347) the following approximation 
lemma. 

Lemma 4. Let f <= TC(ot, R) with ctj — (3j + Pj, r > maxi<j<fe pj, N — (N±, . . . , 
iVfc) G (N*) fc awe? rn = (TV, r). There exists some g £ S m such that 



We are now in a position to state the following corollary of Theorem [TJ 

Corollary 2. Let X be a Poisson process with unknown intensity s with respect to 
the Lebesgue measure on [0, l) k and s be a T-estimator based on the family of linear 
models {S m ,m S M,} that we have previously defined. Assume that y/s belongs to 
the class H.(a, R) and set 



E s [H 2 (s,s)] < C M {C(a,p,l)R 2 2- 2: > a + c±(a)2 jl } . 



k 



||/-s||oo<C-(fc,r)Z>JV 




IfRj > R 



—k/(2a+k) 



for all j , then 



E s [H 2 (s,s)] < C(k,a)R 



—2k/(2a+k) 
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Proof. If a.j = 0j +pj for 1 < j < k, let us set r = maxi<j<k pj , r/ = R and 
define N 3 e N* by {Rjln) l,ai < Nj < (Rj/rj) 1 ^ + 1 so that Nj < 2(R j /r]) 1 / a i 
for all j. It follows from Lemma 0] that there exists some t S i!? m , to = (AT,r) 

with H-v/s-ilU < C^fc,")^^!^^ 7 "': hen c e || — *I|2 < kCi(k,a)r]. It then 
follows from Theorem □ that 



E s [H 2 {s,s)] < C 2 {k,a) 



V 2 + (r + l) k 1] N 3 



3 = 1 



<C 3 (k,a) 



77 2 + i? fe/ V fc /« 



The conclusion follows. 



□ 



4-3. Intensities with bounded a-variation 

Let us first recall that a function / defined on some interval Jcl has bounded 
a-variation on J for some a € (0,1] if 



(4.3) sup sup ^|/(^)-/(^-i)| 1/a -[^(/;J)] 1/Q <+^, 

- xo<—<%i 3=1 
Xj£j for 0<j<i 



the classical case of bounded variation corresponding to a = 1. This formulation us- 
ing the power 1/a (instead of a) implies that an a-H61derian function has bounded 
a-variation over any finite interval J. We want to build a family of linear models 
which are suitable for estimating intensities s with support on some interval J of 
finite length L and such that \fs has bounded a-variation on J for some unknown 
value of a. These models are linear spaces of piecewise constant functions on some 
finite partitions m of J, namely 



S m = I t = ajlij > when to = {Ii, . . . , I D }. 



We consider for M. a special family of partitions to of J derived by dyadic split- 
ting which are in one-to-one correspondence with the family of complete binary 
trees. They are built according to the following "adaptive" algorithm described in 



Section 3.3 of DeVore 17|. This algorithm simultaneously grows a complete binary 
tree and a dyadic partition of J. It starts with a tree reduced to its root which 
is associated to the interval J. At each step of the algorithm the set of terminal 
nodes of the current tree is associated to the set of intervals in the current partition. 
Each step of the algorithm corresponds to choosing one terminal node and adding 
two sons to it. For the associated partition this means dividing the interval which 
corresponds to this terminal node into two intervals of equal length which then 
correspond to the two sons. At some stage the procedure stops and we end with a 
complete binary tree with D terminal nodes and the associated partition of J into 
D intervals. We acually take for M. the set of all finite partitions to that can be 
build in that way so that each m corresponds to the complete binary tree with \m\ 
terminal nodes that was used to build the partition. 

It is known that the number of complete binary trees with j + 1 terminal nodes 
is given by the so-called Catalan numbers (1 + ^) < 4 J /(1 + j) as explained 
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for instance in Stanley [33|, page 172. Setting A m = 2\m\ leads to 
cxp[-A m ] =J2 exp[-2(j + 1)] 

meM j>0 {mEJW I |m|=l+i} 

(4.4) <y 4^exp[-20- + l)] = y (2/e)* L 

The approximation properties of U meM S m with respect to functions of bounded 
a-variation are given by the following proposition the proof of which was kindly 
communicated to the author by Ron DeVore [HI- 

Proposition 3. Let f be a function of bounded a-variation on the interval J of 
finite length L with a-variation V a {f', J) given by |^.ff| ). For each j £ N, one can 
find a partition to £ M. with 

(4.5) \m\< Cl {a)V and inf ||/ - t\\ 2 < c 2 {a)L 1 / 2 V a (fi, J)1~ ja . 

tes m 

with 1 < a(a) = (1 - 2-I 1 /(2«)+i])(i _ 2 -i/(2«))-i < 2.21 and 

1/2 



V2 < c 2 (a) = 



2 l+2a (i _ 2 -[l/(2a)+l]V 
1 _ 2-V(2a) 



< 6.51. 



Proof. For any interval I C J we denote by |/| its length and set V(I) = V a (f; I). 
If to — {Ji; . . . ; Id} is a partition of J into -D intervals, fj = \Ij\ Jr. f(x) dx and 

f = Ef=Jjh v then ||(/ -/,•)!/, Hoc < V(Ij), hence 

(4.6) \\f-f\\l<J2E(Ij) with £(/) = |/|F 2 (I). 

i=i 

In particular (|4.5|) holds with to = { J} and j = 0. To study the general case we 
choose some e > and apply the adaptive algorithm described just before in the 
following way: at each step we inspect the intervals of the partition and if we find 
an interval I with E(I) > e we divide it into two intervals of equal length \I\/2. 
The algorithm necessarily stops since E(I) < \I\V 2 (J) for all I C J and this results 
in some partition m with E(I) < e for all I £ m. It follows from (|4.6[) that if / is 
built on this partition, then |j/ — f\\ 2 < s\m\. Since the case \m\ = 1 has already 
been considered, we may assume that \m\ > 2. Let us denote by the number 
of intervals in to with length L2~ k and set = 2~ k Dk so that J2k>i a k = 1 
(since Do = 0). If I is an interval of length L2~ k , k > 0, it derives from the 
splitting of an interval I' with length L2~ k+1 such that E(I') > e, hence, by (|4.6p . 
V(I') > [£L~ 1 2 k ~ 1 ] 1 / 2 and, since the set function V 1 ^ 01 is subadditive over disjoint 
intervals, the number of such interval I' is bounded by [V(J)] 1 / a [sL~ 1 2 k ~ 1 }~ 1 /( 2a \ 
It follows that 

D k < 7 2- fc /( 2Q > and a k < 1 2- k ^ 2a ^ k with 7 = 2[V{J)] 1/a [e/{2L)]- 1 / [2a \ 

Since |m| = Efc>i2 fc afc, we can derive a bound on |to| from a maximization of 
^2 k>1 2 k ak under the restrictions J2k>i a k — 1 an d a k < 7 2 _fc [ 1 /( 2Q ) +1 l. One should 
then clearly keep the largest possible indices k with the largest possible values for 
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a k . Let us fix e so that 7 = (1 - 2-[ 1 /(2a)+i]) 2 J[i/(2«)+i] f or some j > 1. T h en , 
setting afc to its maximal value, we get Y^k>j 72~ fe [ 1 /(2a)+i] = 1, which implies that 
an upper bound for \m\ is 

1 _ 2-I 1 /(2a) + l] 



„,9-j/(2a) 

- ^ ' 1 - 2- 1 /(2a) 



k>j 



1 _ 2-!/(2a) 



-V. 



The corresponding value of e is 2L( r )/2) V (J) so that 

\\f-f\\l<e\ m \ < 2LV*(J)2 2a \_ 2 _ 1/{2a) 

\-1a 



2LV 2 {J)2 2a (i_2-[ 1 /( 2q )+ 1 1) 

1 _ 2-!/(2a) 



_ 2 -2aj 



These two bounds give 
the two constants. 



and we finally use the fact that < a < 1 to bound 

□ 



We can then derive from this proposition, (|1.15|) and our choice of the A m that 
E s [H*(s, s)] < C(q)M [V 2 + L}/ 2 V a (V?; J) 2^ Q } 9 . 

An optimization with respect to j G N then leads to the following risk bound. 

Corollary 3. Let X be a Poisson process with unknown intensity s with respect 
to the Lebesgue measure on some interval J of length L. We assume that yfs has 
finite a-variation equal to V on J, both a and V being unknown. One can build a 
T- estimator s(X) such that 

(4.7) E s [H«(s, §)] < C{q) [(L^V) V lj 

It is not difficult to show, using Assouad's Lemma, that, up to a constant, this 
bound is optimal when q = 2. 

Proposition 4. Let L,a and V be given and S C L^(A) be the set of intensities 
with respect to the Lebesgue measure on [0, L) such that y/s has a-variation bounded 
by V. Let s(X) be any estimator based on a Poisson process X with unknown 
intensity s G S. There exists a universal constant c > (independent ofs,L,a and 
V) such that 

2/(2a+l) 



supE s [H 2 {s,s)\ > c \(L^ 2 v) V 1 

ses 



Proof. If L X I 2 V < 1, we simply apply (|2.6p with so = l[o,L) an d si = (1 + 
i _1 / 2 ) 2 l[ £ -j so that 2H 2 (s , si) = 1. If L = 1 and V > 1 we fix some positive 
integer D and define g with support on [0, D~ r ) by 

g(x) = xl[ 0) (2 D )-i)(x) + (D^ 1 - x) l[( 2 £>)-i,£)-i)(ac)- 

Then J^ /D g 2 (x) dx = (12D 3 )- 1 and < g(x) < (2D)- 1 . If we apply the con- 
struction of Lemma [21 we get a family of Lipschitz intensities s$ with values in the 
interval [12£> 3 - 3D 2 , 12D 3 + 3D 2 } C [9£> 3 , 15L> 3 ] and Lipschitz coefficient 6D 3 . It 
follows that if0<ai<y<l, 

\ss{x) - ss(y)\ 



V ss(x) - yjss(y) 



< 



< 



6D 3 / 2 

[6D 2 ) A (6D 3 \x - y\) ^ r 
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This allows us to bound the a-variation of y/ss in the following way. For any in- 
creasing sequence < xq < ■ ■ ■ < x% < 1, 

l/a 

< £)l/(2a) 

2=1 



2=1 



2=1 



^"^-l) 1/C 



If n = J2j=i Ifsj-x^-i^D- 1 } < A then 

2=1 

< D 3/(2a) D -l/a( D _ n ) = D l/(2a)p _ ^ 

which shows that the a-variation of v /s5 is bounded by [D 1 ^ 2 "^]" = d( 1+2q )/ 2 . 
We finally choose for D the largest integer j such that j( 1 + 2a )/ 2 < y. Then 
y2/(i+2a) < 2£) anc j an application of Lemmas [1] and [5] show that 

sup E s [H 2 {s,s)] > 2~ 8 (2D)exp[-2/7] > 2~ 8 exp[-2/7]^ 2/(1+2a) , 

seSn 

which proves our lower bound. The general case L X I 2 V > 1 follows from a scaling 
argument. If X is a Poisson process on [0, L) with intensity s (with respect to the 
Lebesgue measure), then Y = L~ l X is a Poisson process on [0, 1] with intensity sl 
to which the previous results apply. Since sl(u) = Ls(Ly), it follows that H 2 (s, t) = 
H 2 (sl, *l) and, if ^/s has a-variation bounded by V, yfsZ has a-variation bounded 
by L^-^V. The result for an arbitrary L follows from these remarks. □ 



4-4- Intensities with square roots in weak £ q -spaces 
4-4-1- Approximation based on weak £ q - spaces 

As we already mentioned, if s G IL^(A) is an intensity with respect to A on A" 
and we are given an orthonormal basis j > 1} of L2(A), ^/s can be written as 
T,i>iPm with P = (Pi)j>l e ^ = W®*) and Ej>i0j = llv^lll < +oo. Hence, 
for all x > 0, \{j > 1 1 \{3j\ > x}\ < \\y/s\\ 2 x~ 2 , which means that the sequence (3 
belongs to the weak ^2-space I™ . 

More generally, given a sequence (3 = {Pj)j>i converging to zero and aj the 
rearrangement of the numbers \[3j\ in nonincreasing order (which means that a\ = 
supj >1 \/3j\, etc. . . ), we say that /3 belongs to the weak ^-space I™ (q > 0) if 

(4.8) supx" \{j >l\\0j\ > x}\ = supx* \{j >l\aj> x}\ = \0\l w < +oo. 

This implies that aj < \(3\ q . w j^ 1 ^ q for j > 1 and the reciprocal actually holds: 

(4.9) \f3\ q , w =M{y >0\a 3 <yr 1/q forallj>l}. 

Note that, although \8j3\ q ^ w — \6\\fi\ q<w for € R, \fl\ q ,w is not a norm. For con- 
venience, we shall call it the weight of (3 in t™. By extension, given the basis 
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{<Pj,j > 1}, we shall say that u £ L 2 (A) belongs to i™ if u — J2j>iPjfj an d 
(3 £ £g. As a consequence of this control on the size of the coefficients a,j, we get 
the following useful lemma. 

Lemma 5. Let (3 E I™ with weight \(3\ q>w for some q > and {p-j)j>\ be the 
nonincreasing rearrangement of the numbers \Pj\- Then (3 £ £ p for p > q and for 
all n > 1, 

(4.10) £ a ? < -^|/3|^( n + l/2)-(^)/«. 
Proof. By (|4.9[) and convexity, 

r + oo 

E l^lff,™ E r p/q < \P\i w / ^" P/9 dx. 

j>n j>n • / ™+ 1 /2 □ 

As explained in great detail in Kerkyacharian and Picard [23[ and Cohen, DeVore, 
Kerkyacharian and Picard [lij], the fact that bcI^ for some q < 2 has important 
consequences for the approximation of u by fonctions in suitable D-dimensional 
spaces. For to any finite subset of N*, let us define S m as the linear span of {<fij,j £ 
to}. If it = X^>i belongs to £™ and D is a positive integer, one can find some 
to with \m\ — D and some t £ S m such that 

(4.11) \\u -t\\l< (2/q l)- l \(3\l w (D + 1/2) 1 - 2 /*. 

Indeed, let us take for to the set of indices of the D largest numbers \f3j\. It follows 
from (|4~T0l) that 

Setting i = Sj £m Pjfj gives (|4.11|) which provides the rate of approximation of 
u by functions of the set \J^ m | i m i = m <Sm as a decreasing function of Z? (which 
is not possible for g = 2). Unfortunately, this involves an infinite family of linear 
spaces S m of dimension D since the largest coefficients of the sequence (3 may have 
arbitrarily large indices. To derive a useful, as well as a practical approximation 
method for functions in £™-spaces, one has to restrict to those sets to which are 
subsets of . , n} for some given value of n. This is what is done in Kerkyacharian 
and Picard [23( who show, in their Corollary 3.1, that a suitable thresholding of 
empirical versions of the coefficients (3j for j £ {1, . . . ,n} leads to estimators that 
have nice properties. Of course, since this approach ignores the (possibly large) 
coefficients with indices bigger than n, an additional condition on (3 is required to 
control J2j> n Pj- I R Kerkyacharian and Picard (23[, it takes the form 

(4.12) Pj ^ A2n ~ S for a11 n ^ X > with A and 6 > °> 

j>n 

while Cohen, DeVore, Kerkyacharian and Picard [l5(, page 178, use the similar 
condition BS. Such a condition is always satisfied for functions in Besov spaces 
5 p,oo([0, 1]') withp < 2 and a > l(l/p- 1/2). Indeed, if 

OO 

/ E E ^i.fc^fc 
j=-i fceA(j) 
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belongs to such a Besov space, it follows from (I4.ip that, 

2/p 



(4.13) 



E E i^i 2 <E E i/w 

j>JkeA(j) j>J \fceA(j) 



< 1/1 



Since the number of coefficients [Jj^ with j < J is bounded by C 2 , after a proper 
change in the indexing of the coefficients, the corresponding sequence (3 will satisfy 

E J>n $ < ^ a »~ 4 with * = (WO + 1 - (2/P)- 
4-4-2- Model selection for weak £ q -spaces 

It is the very method of thresholding that imposes to fix the value of n as a function 
of S or impose the value of S when n has been chosen in order to get a good 
performance for the threshold estimators. Model selection is more flexible since it 
allows to adapt the value of n to the unknown values of A and 6. Let us assume that 
an orthonormal basis {<fj,j > 1} for L2(A) has been chosen and that the Poisson 
process X has an intensity s with respect to A so that y/s = X^">i Pjfj with f3 G li- 
We take for M. the set of all subsets m of N* such that \m\ = 2- 7 for some j G N and 
choose for S m the linear span of {<fj,j G m} with dimension D m = \m\. If \m\ = 2 J 
and k = va£{i € N* | 2 l > I for all I em}, we set A m = k + log (^) . Then 



exp[-A m ] <EE 

m&M k>lj=0 



2J 



exp 



-k — log 



<^(fc + l)exp[-fc], 



fe>i 



which allows to apply Theorem [T] 

Proposition 5. Let s be a T-estimator provided by Theorem]!] and based on the 
previous family of models S m and weights A m . If y/s = Ylj>i PjVj with (3 G P£ 
for some q < 2 and ft4-12ty holds with A > 1 and < S < 1, ifte rzsfc of s at s is 
bounded by 



with 



R 



E s [H 2 (s, a)] < C [(i 1 ^ 2 (R 2 V 7 ) 9/2 ) f\ A 2 '^ 

log (5[AVR} 2 



1/2 



and 7 = S 



log 2 



V 1 



Proof. Let (tij).j>i be the nonincreasing rearrangement of the numbers \/3j\, k and 
j < k be given and m be the set of indices of the 2 J largest coefficients among 
{|/3i|, . . . , \f3 2 k |}. Then D m = 2? and A m < k + log (**). It follows from (|4~TU|) and 
(|4~T2"1) that 



Etf <(e«?U<* + £/£< 

)Vm \i>23 / i>2 h 



ki) 



This shows that one can find t G S m such that ||Vs-*||l < J2 2 2 _ JC 2 / l J<fe + 
A 2 2~ kS and it follows from (fl~T4| that 



E s \H 2 (s, s)l < C inf inf <^ J2 2 2^' (2/ «- 1) l 7 - <fc + A 2 2- kS + V + k + log 

L J fc>10<j<fc 1 



2i 



51 



L. Birge 



< 



We recall that C denotes a constant that may change as often as necessary. If 
j = fc, E s [H 2 (s, §)] < C[A 2 2~ kS + 2 k ] and an optimization with respect to fc leads 
to E s [H 2 (s, §)] < CA 2 /( 1+& \ For j < k, we notice that A m < fc+2 J '[l+log(2 fc - J ) 
3k23 , so that 

(4.14) E s [H 2 (s,s)] < CM ^(A 2 2~ kS ) V q mf fc { (i^-^-i)) V (fc2^)} 
If R 2 2~( k ~ 1 ^ 2 / q ~ 1 ' 1 > fc2 fe ~ 1 , we may harmlessly increase k until k = K with 



K 



inf { 



j > 1 



i2' 



-1 > jR 2 2 -( l -l)(2/ g -l) | = inf | • > 1 I 2 i-l > ^.-,72 | 



and therefore restrict the minimization in (|4. 14[) to k > K . We then choose for j 
the smallest integer i such that 2 l > (R 2 /k) q / 2 , which leads to 

E s [H 2 (s,s)] < C mf_ j(A 2 2- fe5 ) V {Wk 1 -^ 2 ^ V fc} . 

It follows from Lemma ©below (with a = 1) that, if 8 A 2 < 2, (A 2 2~ kS ) \J k > A 2 /2 
for all k which does not improve on our previous bound CA 2 ^ 1+S ^ so that we may 
assume from now on that 8 A 2 > 2, hence 7 > <5 _1 . Handling this case in full 
generality is much more delicate and we shall simplify the minimization problem 
by replacing A by A = A V R, which amounts to assuming that A > R and leads 
to E s [# 2 (s,s)] < Cmi k > K f(k) with 

f(x) = h(x)Vf 2 (x)Wx; f 1 (x)=A 2 2- xS and f 2 (x) = R^x 1 ^ 2 , 

We want to minimize f(x), up to constants. The minimization of fi(x) V x follows 
2 

from Lemma [6] with 8 A > 2. The minimum then takes the form C27 > 0.4697 
with /i(7) = S^ 1 < 7 hence f(-y) = 7 V /z(7). To show that inf^ f(x) > 2/(7) 
when SA > 2, we distinguish between two cases. If R 2 < 7, f(j) = 7 and we 
conclude from the fact that inf^ f(x) > 0.4697. If R 2 > 7, fe{x) > x for x < 7, 
/(t) — fail) > 7 an d the minimum of /(x) is obtained for some xq < 7. Hence 



inf fix) = inf{/i(ar) V f 2 (x)} = R q mi\(B2- &x ) V x 1 "^ 2 } 



with S = A iT". 



It follows from Lemma |5] with a = (2 — <?)/2 that the result of this minimization 
depends on the value of 



V = 



28 



-A 



2-q 
since A > R. Then, 

inf/(x) > R q 



4/{2 - q) R-2q/(2-q) 



2A 2 S (A- 2q/i2 - q) 



q \R 



> A 8 > 2, 



(2-g) \ogV 



n l-q/2 



'■id 



> QAhR q 1 1 - q/2 



(2-g) log 2 



1 1-9/2 



and we can conclude that, in both cases, inf a, f(x) > 0.45/(7). Let us now fix k such 
that 7+l</c<7 + 2so that k < 3j. Then 2 k - 1 > 2~< = (A^S) 1 / 5 while R q k- q l 2 < 
(R 2 h) q l 2 < (R 2 8) q / 2 . This implies that k > K. Moreover /(fc) = fcV/ 2 (fc) < 3/(7) 
which shows that inffc>#- /(fc) < 3/(7) < 6. 7 inf ^ f(x) and justifies this choice of fc. 
Finally E s [ff 2 ( S ,s)] < C[ 7 V/ 2 ( 7 )]- □ 
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Note that our main assumption, namely that (5 € implies that Ej> n a ^ — 
R 2 n -2/ g +i by while entails that VV >n a? < E i>n /3j < A 2 n- S . Since 

it is only an additional assumption it should not be strictly stronger than the main 
one, which is the case if A < R and <5 > 2/q — 1. It is therefore natural to assume 
that at least one of these inequalities does not hold. 

Lemma 6. For positive parameters a, B and 9, we consider on R + the function 
f(x) = B2~ 5x V x a . Let V = a^SB 1 ^. If V < 2 then inf x f(x) = aB with 
2~ a < ci < 1. IfV > 2, then inf x f(x) = [c^afT 1 \ogV] a with 2/3 < c 2 < 1. 

Proof. Clearly, the minimum is obtained when x = xq is the solution of B2~ Sx = x a . 
Setting xq = B x l a y and taking base 2 logarithms leads to y~ x log 2 (?/ _1 ) = V, hence 
y < 1. If V < 2, then 1 < y^ 1 < 2 and the hrst result follows. If V > 2, the solution 
takes the form y = zV^ 1 \og 2 V with 1 > z > [1 — (log 2 V)^ 1 log 2 (log 2 V)} > 
0.469. " " " □ 



4- 4- 3. Intensities with bounded variation on [0, l) 2 

This section, which is devoted to the estimation of an intensity s such that y/s be- 
longs to the space BV([0, l) 2 ), owes a lot to discussions with Albert Cohen and Ron 
DeVore. The approximation results that we use here should be considered as theirs. 
The definition and properties of the space BV([0, l) 2 ) of functions with bounded 
variation on [0, l) 2 are given in Cohen, DeVore, Petrushev and Xu [16| where the 
reader can also find the missing details. It is known that, with the notations of 
Section O for Besov spaces, #^([0, l) 2 ) C BV([0, l) 2 ) C £^([0, 1) 2 )- This cor- 
responds to the situation a = 1, 1 = 2 and p = 1, therefore a = l(l/p— 1/2), a 
borderline case which is not covered by the results of Theorem01 On the other hand, 
it is proved in Cohen, DeVore, Petrushev and Xu [l^|, Section 8, that, if a function 
of BV([0, l) 2 ) is expanded in the two-dimensional Haar basis, its coefficients belong 
to the space I™. More precisely if / S BV([0, l) 2 ) with semi-norm \ f\sv and / is 
expanded in the Haar basis with coefficients /3j, then |/3|i jU , < C\f\sv where \[3\i. w 
is given by (|4.8p and C is a universal constant. We may therefore use the results 
of the previous section to estimate y/s but we need an additional assumption to 
ensure that (|4. 12|) is satisfied. By definition y/s belongs to L 2 ([0, l) 2 , dx) but we 
shall assume here slightly more, namely that it belongs to L p ([0, l) 2 , dx) for some 
p > 2. This is enough to show that (|4. 1 2[) holds. 

Lemma 7. /// 6 BV([0, l) 2 )nL p ([0, l) 2 , dx) for some p > 2 and has an expansion 
fceA(j) Pj,k^ 3 ,k with respect to the Haar basis on [0, l) 2 , then for 

J > -I, 

E E i/'i,*i 2 <G(p)ii/ii,i/i B x ao 2- aj ( i / a - i /»). 

j>JkeA(j) 

Proof. It follows from Holder inequality that \(3j,k\ = (f,<Pj,k) < H/IWI^fellp' with 
p' _1 = 1— and by the structure of a wavelet basis, ||vj,fc||p' < ci2 _ ^ 2 ~ p so that 
\P 3 ,k\ < c 2 ||/|| p 2-^/p'-i) = c 2 \\f\\ p 2-^- 2 M. Since BV([0, l) 2 ) C i? 1 1 , oo ([0, l) 2 ), 
it follows from (|4.ip with a = p = 1 and / = 2 that EfceA(j) \Pj,k\ < l/lsj so that 
EfeeAO') lA-.fel 2 < C2||/|| P |/| B i Jl-^-^lv) for all j > 0. The conclusion follows. □ 

Since the number of coefficients (3j^ with j < J is bounded by C2 2J , after 
a proper reindexing of the coefficients, the corresponding sequence (3 will satisfy 
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(|4.12p with 5 = 1/2 — 1/p which shows that it is essential here that p be larger than 
2. We finally get the following corollary of Proposition [5] with q = 1. 

Corollary 4. One can build a T-estimator s with the following properties. Let the 
intensity s be such that ^/s G BV([0, l) 2 ) n L p ([0, l) 2 , dx) for some p > 2, so that 
the expansion of \fs in the Haar basis satisfies \4-12$ with 5 = 1/2— 1/p and A > 1. 
Let R = \y/s\Bv, then 



E [H 2 (s, $)] < C \^/j(R 2 V 7 ) A A 2 ' {1+5 ^ 



with 7 = 5 



log (S[AVR} 2 ) 
log 2 



4-5. Mixing families of models 

We have studied here a few families of approximating models. Many more can be 
considered and further examples can be found in Reynaud-Bouret [301 ] or previous 
papers of the author on model selection such as Barron, Birge and Massart 
Birge and Massart [13], Birge and Baraud and Birge Q. As indicated in the 
previous sections, the choice of suitable families of models is driven by results in 
approximation theory relative to the type of intensity we expect to encounter or, 
more precisely, to the type of assumptions we make about the unknown function 
t/s. Different types of assumptions will lead to different choices of approximating 
models, but it is always possible to combine them. If we have built a few families of 
linear models {S m ,m 6 for 1 < j < J and chosen suitable weights A m such 

that X^meA-i ex P[ — A m ] < ^ for all j we may consider the mixed family of models 
{S m ,m e M} with M = Uj =1 Mj and define new weights A' m — A m + log J 
for all to £ M. so that (| 1 . 1 3[) still holds with the same value of S. It follows from 
Theorem Q] that the T-estimator based on the mixed family will share the properties 
of the ones derived from the initial families apart, possibly, for a moderate increase 
in the risk of order (log J) q / 2 . The situation becomes more complex if J is large 
or even infinite. A detailed discussion of how to mix families of models in general 



has been given in Birge and Massart [12j, Section 4.1, which applies with minor 
modifications to our case. 



4-6. Asymptotics and a parallel with density estimation 

The previous examples lead to somewhat unusual bounds with no number of ob- 
servations 77 like for density estimation and no variance size a 2 as in the case of 
the estimation of a normal mean. Here, there is no rate of convergence because 
there is no sequence of experiments, just one with a mean measure fi s — s ■ X. 
To get back to more familiar results with rates and asymptotics and recover some 
classical risk bounds, we may reformulate our problem in a slightly different form 
which completely parallels the one we use for density estimation. As indicated in 
our introduction we may always rewrite the intensity s as s — ns\ with J si d\ = 1 
so that si becomes a density and 77 = fi s (X). We use this notation here, although 
77 need not be an integer, to emphasize the similarity between the estimation of s 
and density estimation. When 77 is an integer this also corresponds to observing 77 
i.i.d. Poisson processes X i; 1 < i < n with intensity si and set Ax = Y^7=i ^ x i- 
In this case (|1.15j) can be rewritten in the following way. 
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Corollary 5. Let A be some positive measure on X , X be a Poisson process with 
unknown intensity s G ILj~(A), {S m ,m G A4} be a finite or countable family of 
linear subspaces o/L2(A) with respective finite dimensions D m and let {A m } m6 _A/f 
be a family of nonnegative weights satisfying 11.13]) . One can build a T- estimator 
s(X) of s satisfying, for all s G L^A) such that J s dX — n, si = n _1 s and all 



E. s 



-1/2 



H(s,s)) <C(g)[l + E] inf { inf \\y/^-t 





Writtten in this form, our result appears as a complete analogue of Theorem 6 
of Birge about density estimation, the normalized loss function (H/y/n) q play- 
ing the role of the Hellinger loss h q for densities. We also explained in Birge [p|, 
Section 8.3.3, that there is a complete parallel between density estimation and es- 
timation in the white noise model. We can therefore extend this parallel to the 
estimation of the intensity of a Poisson process. This parallel has also been ex- 
plained and applied to various examples in Baraud and Birge jij, Section 4.2. As 
an additional consequence, all the families of models that we have introduced in Sec- 
tions 13.31 14.21 14.31 and 14.41 could be used as well for adaptive estimation of densities 
or in the white noise model and added to the examples given in Birge 0]. 

To recover the familiar rates of convergence that we get when estimating densities 
which belong to some given function class S, we merely have to assume that S\ 
(rather than s) belongs to the class S and use the normalized loss function. Let us, 
for instance, apply this approach to intensities belonging to Besov spaces, assuming 
that y^sl G B£ tOO ([0,i\ l ) with a > l(l/p- 1/2)+ and that iV^ils"^ < L with 
L > 0. It follows that yfs G S™ )OO ([0, 1] 1 ) with | y/s\ B ^ x < Ly/n. For n large enough, 

Ly/n~> 1 and Theorem!! applies, leading to E s [H 2 (s,s)] < C(a,p,l)(Ly/n) 2l ^ 2a+l \ 
Hence 

E s [n-Wfai)] < C(a,p,0i 2i/(2Q+0 «- 2a/(2a+0 , 

which is exactly the result we get for density estimation with n i.i.d. observations. 

The same argument can be developed for the problem we considered in Sec- 
tion [321 If w e assume that ,/si, rather than y/s, belongs to H(a,R), then */s G 
Ti(a.,y/n~R) and the condition Rj > r\ of Corollary [2] becomes, after this rescaling, 
y/nRj > (y / ni?) fc /( 2Q+fc ) which always holds for n large enough. The corresponding 
normalized risk bound can then be written 

E s [n- x if 2 ( S ,s)] < C(k,a)R 2k/(m+k) n- m /^ +k \ 

which corresponds to the rate of convergence for this problem in density estimation. 

Another interesting case is the one considered in Section WM Let us assume here 
that instead of putting the assumptions of Proposition [5] on y/s we put them on 
yfs~i. This implies that \fs satisfies the same assumptions with R replaced by Ry/n 
and A by Ay/n. Then, for n > n (A, R,S), 7 < 2S~ 1 logn < nR 2 and 



E s [ri^ff 2 ^,!)] < C{q,5,A,R) (n _1 logn) 



1-9/2 



This result is comparable to the bounds obtained in Corollary 3.1 of Kerkyacharian 
and Picard [23j but here we do not know the relationship between q and 5. For 
the special situation of y/si G BV([0,1) 2 ), we get E s [n~ 1 H 2 (s, s)] < C(q,S,si) x 
(rt -1 logn) 1 / 2 . One could also translate all other risk bounds in the same way. 
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An alternative asymptotic approach, which has been considered in Reynaud- 
Bouret [30(, is to assume that X is a Poisson process on R fc with intensity s with 
respect to the Lebesgue measure on M. k , but which is only observed on [0,T] fe . 
We therefore estimate slmyife, letting T go to infinity to get an asymptotic result. 
We only assume that Jj Q T -^ k s{x) dx is finite for all T > 0, not necessarily that 

J Rk s(x) dx < +00. For simplicity, let us consider the case of intensities s on R + 
with y/s belonging to the Holder class TC(a,R). For t an intensity on R + , we set 
for < x < 1, tr(x) — Tt(Tx) so that tr is an intensity on [0, 1] and H(tT, ipt) = 
H(tl [0iT]l ut [0tT] ). Since v^r e H{a,RT a+l / 2 ) it follows from Corollary [2] that 
there is a T-estimator §t{X) of st satisfying 

E s [H 2 (s T ,s T )] < C(a) (i?T Q+1 / 2 ) 2/(2 " +1) - C{a)TR 2 ^ 2a+l \ 

Finally setting s(y) = T~ 1 §t (T~ lr y) for y e [0, T], we get an estimator s(X) of 
sl[p depending on T with the property that 

E s [H 2 (sl [0 ,r], *)] < C(a)Ti? 2/(2a+1) for all T > 0. 



4-7. An illustration with Poisson regression 



As we mentioned in the introduction, a particular case occurs when X is a finite 
set that we shall assume here, for simplicity, to be {f ; . . . ; 2 n }. In this situation, 
observing X amounts to observing N — 2" independent Poisson variables with 
respective parameters s, = s(i) where s denotes the intensity with respect to the 
counting measure. If we introduce a family of linear models S m in Wi N to approxi- 
mate y/s £ M. N with respect to the Euclidean distance, we simply apply Theorem [1] 
to get the resulting risk bounds. In this situation, the Hellinger distance between 
two intensities is merely the Euclidean distance between their square roots, up to 
a factor 1/V2. 

As an example, we shall consider linear models spanned by piecewise constant 
functions on X as described in Section Tl.41 i.e. S m — {S^Li a j 1^ } when m = 
{Ii, . . . ,Id} is a partition of X into D = \m\ nonvoid intervals. In order to define 
suitable weights A m , we shall distinguish between two types of partitions. First 
we consider the family M.bt of dyadic partitions derived from binary trees and 
described in Section [4~3l We already know that the choice A m = 2\m\ is suitable for 
those partitions and (|4.4[) applies. Note that these include the regular partitions, 
i.e. those for which all intervals Ij have the same size N/\m\ and |m| = 2 k for 
< k < n. For all other partitions, we simply set A m = log (|^|) + 21og(|m|) so 
that (|1 . 13|) holds with S < 3 since the number of possible partitions of X into |m| 
intervals is (^r 2 ]}- We omit the details. Denoting by || • H2 the Euclidean norm in 
R^, we derive from Theorem [T] the following risk bound for T-estimators: 



E., 



\fs — VI 
< C 



inf 

meM bi 



inf 

tes m 



t\\ + \m\ 



A inf 



inf 

tes m 



f|L+log(|m|) + log 
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The performance of the estimator then depends on the approximation properties 
of the linear spaces S m with respect to y/s. For instance, if y/s varies regularly, i.e. 
\y/s~i — y/siZi\ < R for all i, one uses a regular partition which belongs to M.bt to 
approximate y/s. If y/s has bounded a- variation, as defined in Section [4.31 one uses 
dyadic partitions as explained in this section. If y/s is piecewise constant with k 
jumps, it belongs to some S m and we get a risk bound of order log(A:+ 1) + log (J^) • 

5. Aggregation of estimators 

In this section we assume that we have at our disposal a family {s mi m G A4'} 
of intensity estimators, (T-estimators or others) and that we want to select one 
of them or combine them in some way in order to get an improved estimator. We 
already explained in Section [2T3] how to use the procedure of thinning to derive from 
a Poisson process X with mean measure \i two independent Poisson processes with 
mean measure Since estimating fi/2 is equivalent to estimating /i, we shall 
assume in this section that we have at our disposal two independent processes X± 
and X.2 with the same unknown mean measure [i s with intensity s to be estimated. 
We assume that the initial estimators s m (Xj) are all based on the first process and 
therefore independent of X 2 . Proceeding conditionally on the first process, we use 
the second one to mix the estimators. 

We shall consider here two different ways of aggregating estimators. The first 
one is suitable when we want to choose one estimator in a large (possibly infinite) 
family of estimators and possibly attach to them different prior weights. The second 
method tries to find the best linear combination from a finite family of estimators 

Of y/s. 

5.1. Estimator selection 

Here we start from a finite or countable family {s TO , m € M} of intensity estimators 
and a family of weights A m > 1/10 satisfying (|1.13|) . Our purpose is to use the 
process X2 to find a close to best estimator among the family {s m (Xi),m eM}. 

5.1.1. A general result 

Considering each estimator s m (X±) as a model S m = {s m (X±)} with one single 
point, we set rj^ = 84A m . Then S m is a T-model with parameters 77m ,1/2 and 
B' = e- 2 , ([321) and (O hold and Theorem [3] applies. Since each model is reduced 
to one point, one can find a selection procedure rh(X2) such that the estimator 
s(Xi,X2) = •Sm(x 2 )(-^"i) satisfies the risk bound 

E s [H 2 {s, s)\ X ± ] < C[l + E] inf { H 2 (s, s^X^) 2 + A m \ . 

Integrating with respect to the process Xi gives 

(5.1) E s [H 2 (s, s)] < C[l + E] inf {E s [H 2 (s, s m )] + A m } . 

This result completely parallels the one obtained for density estimation in Sec- 
tion 9.1.2 of Birge [§]. 



60 



L. Birge 



5.1.2. Application to histograms 

The simplest estimators for the intensity s of a Poisson process X are histograms. 
Let m be a finite partition m = {Ji, . . . , In} of X such that X(Ij) > for all j. To 
this partition corresponds the linear space of piecewise constant functions on the 
partition m: S rn — {J2f=i a j^ij}> the projection s m of s onto S m and the corre- 
sponding histogram estimator s m of s given respectively by s m = X^i=i(/j s ^A) x 

[ACXf)]- 1 ^ and § m = Ef=i iVjtA^)]- 1 !/, with JV,- - Eti M^Q)- It is proved 
in Baraud and Birge 0], Lemma 2, that H 2 (s,s m ) < 2H 2 (s, S m ). Moreover, one 
can show an analogue of the risk bound obtained for the case of density estimation 
in Birge and Rozenholc [TH, Theorem 1. The proof is identical, replacing h by H, 
n by 1 and the binomial distribution of iV by a Poisson distribution. This leads to 
the risk bound 

E s [H 2 (s, s m )] < H 2 (s, a m ) + D/2 < 2H 2 (s,S m ) + \m\/2. 

If we are given an arbitrary family A4 of partitions of X and a corresponding 
family of weights {A m ,m € M} satisfying (|1 . 13|) and A m > \m\/2, we may apply 
the previous aggregation method which will result in an estimator s(Xi, X2) — 
Sm(x 2 )(-^i) where rh{X-2) is a data-selected partition. Finally, 

(5.2) E s \H 2 (s,S)] < C[l + E] inf {H 2 (s,S m ) + A m \ . 

Various choices of partitions and weights have been described in Baraud and Birge 
[H together with their approximation properties with respect to different classes 
of functions. Numerous illustrations of applications of (|5.2|) can therefore be found 
there. 



5.2. Linear aggregation 

Here we start with a finite family {sj(Xi), 1 < i < n} of intensity estimators. We 
choose for M. the set of all nonvoid subsets of {1, . . . , n} and to each such subset 
m, we associate the |m| -dimensional linear subspace S m of L2(A) given by 



(5.3) 



S m = j ^ XjJsjiXx) with Xj e K for j e 



We then set A m = log (",) + 21og(|m|) so that (fTTT5|) holds with S = ^Li ^' ''■ 
We may therefore apply Theorem [1] to the process X2 and this family of models 
conditionally to Xx, which results in the bound 



E s [ff 2 (a,S)| jr x ] <C[1+E] 

x inf <^ inf \\y/s - t(Xi)|L + log 
meM IteSm 



log(|m|) 



Note that the restriction of this bound to subsets m such that |m| = 1 corresponds 
to a variant of estimator selection and leads, after integration, to 



E s [H 2 (s, §)] < C[l + E] i inf I inf E s 



y/s- X\Jsi(Xx) 



logn ^ . 



This can be viewed as an improved version of (|5.ip when we choose equal weights. 
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6. Testing balls in (Q + (X),H) 

6.1. The construction of robust tests 

In order to use Theorem El we have to find tests ip t ,u satisfying the conclusions 
of Proposition [1] These tests are provided by a straightforward corollary of the 
following theorem. 

Theorem 5. Given two elements ir c and v c of Q+{X) with respective densities 
dn c and dv c with respect to some dominating measure A € Q+(X) and a number 
£ G (0,1/2), let us define ir m and v rn in Q + (X) by their densities d-n m and dv m 
with respect to A in the following way: 



y/ dm m = i^fdvc + (1 - OVotTc and \f ' dv m — ^\fdn c + (1 - £)\fdv c . 
Then for all x € R, fJ> € Q+(X) and X a Poisson process with mean measure /j, 

'2 



'-■I few I- 2 ' 



< 



exp 



-a; + (1-20 ( ~H 2 (»,v c )-H 2 (n c ,v c ) 



log(§^(X)) ; 2.r 



< exp 



x + (l-2£)[ ^H 2 (n,ir c )-H 2 (ir c ,v c ) 



\dQ u . 

Corollary 6. Let tt c and v c be two elements of Q + (X), < £ < 1/2 and 

T(X) = log ((dQ„JdQ u J(X)) - 2x, 

with 7r m and v m given by Theorem^ Define a test function tp with values in {n c , v c } 
by ip{X) — 7r c when T(X) > 0, ^P{X) — v c when T(X) < (iJ)(X) being arbitrary 
ifT(X) — 0). If X is a Poisson process with mean measure u, then 

P^(X) = 7T C ] < exp [-a; - (1 - 20 2 H 2 (n c , v c )} if H{u,u c ) < £H(ir c ,v c ) 



F^(X) = v c ] < exp [x- (1 - 20 2 H 2 (n c ,v c )] if H(u,tt c ) < £H{-k c ,v c ). 

To derive Proposition [T] we simply set 7T C = fit, v c = fj, u , £ = 1/4, x = [r] 2 (t) — 
?y 2 (u)]/4 and define tpt,u — ip m Corollary El As to (j3.5j) . it follows from the second 
bound of Theorem [5] 



6.2. Proof of Theorem^ 



It is based on the following technical lemmas. 

Lemma 8. Let f, g, /' S LJ" (A) and \\g/f\\ao < K. Denoting by (•, •) and 
ffte scalar product and norm in L,2(A), we get 



(6.1) 



gr 1 rd\<K\\f~f\\ 2 + 2(g 1 f)-(g,f) 
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Proof. Denoting by Q the left-hand side of (|6.ip we write 



Q = J grHf -f) 2 d\ + 2 1 gf'dX- JgfdX, 



hence the result. 



□ 



Lemma 9. Let /x, ir and v be three mean measures with ir <C v and \\dit / dv\\oo < K 2 
and let X be a Poisson process with mean measure /i. Then 



dQ v 



(X) 



< exp [2K H 2 (fi, v) - 2H 2 (tt, fi) + H 2 (tt, v)] 



Proof. By and (fL2"|) . 



/ cJQtt 



X) 



exp 



exp 



exp 



u{X) 


-n(X) 




2 


v(X) 


-ir(X) 




2 


v{X) 


-<x) 



.V 



d7T 



(x) — 1 dfi(x) 



x 



-T-{ x ) d V( x ) 

av 



Using Lemma E]and (|1.7[) . we derive that 

dfi(x) < 2KH 2 (fi, v) + 2 J y/dnd^ - J \Jdmdv 



x 



The conclusion follows. 



2KH 2 (fi, v) - 2H 2 {n, n) + n(X) + n(X) 
+ H 2 (ir,is) - (l/2)[n(X) + v(X)\. 



□ 



To prove Theorem [5l we may assume (changing A if necessary) that /i -C A and 
set v — y/ dfi/dX. We also set t c = yj dTT c /dX, u c = y di/ c /dX, t m = £u c +(l—£)t c and 
«m = & c + (1 - £) M c- Then 7r m = ^„ • A and v m = u 2 n ■ A. Note that t c , u c , t m , u m 
and v belong to Lj(A) and that for two elements w,z in L^~(A), ||iu — z||| = 
2H 2 (w 2 ■ X, z 2 ■ A). Since ||i m /u m ||oo < (1 — we ma y a PPly Lemma [9] with 

if = (1 - to derive that 



L = log E„ 



dQ v „ 



(X) 



1 Cm || 2 II i || 2 i ll^m u m||2 

< Wv - U m \\ 2 - \\V - t m \\ + - 



£ II — "* IIZ II - fffc ||Z ' 2 

Using the fact that 

v - u m = v- u c + £,{u c - t c ), v - t rn = v - u c + (1 - £)(«<. - t c ), 

tm - U m = (1 - 2£) (t c - U c ) 

and expending the squared norms, we get, since the scalar products cancel, 



L < — - — \\v - u c \\ 2 + 



£(i-0-(i-0 2 + 
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which shows that 



L < (1 - 2£) [JJT^Oi, Vc) ~ H 2 (n c , u c )] . 



The exponential inequality then implies that 




cxp[— x + L], 



which proves the first error bound. The second one can be proved in the same way. 
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